Data Set Information:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Citation
[1] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
[2] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]
Task: Can you cluster the customers and show their clusters?
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('bank_marketing_dataset.csv')
mask = (df.dtypes == 'int64') | (df.dtypes == 'float64' )
df_num = df.drop(df.columns[-mask], axis=1)
df_num.shape
(41188, 10)
First of all we seperate all numeric variables from the dataframe and see if by using only these variables we can draw meaningful results
df_num.isna().sum()
age 0 duration 0 campaign 0 pdays 0 previous 0 emp.var.rate 0 cons.price.idx 0 cons.conf.idx 0 euribor3m 0 nr.employed 0 dtype: int64
sns.pairplot(df_num)
plt.show()
df_num.describe()
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 41188.00000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 |
| mean | 40.02406 | 258.285010 | 2.567593 | 962.475454 | 0.172963 | 0.081886 | 93.575664 | -40.502600 | 3.621291 | 5167.035911 |
| std | 10.42125 | 259.279249 | 2.770014 | 186.910907 | 0.494901 | 1.570960 | 0.578840 | 4.628198 | 1.734447 | 72.251528 |
| min | 17.00000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -3.400000 | 92.201000 | -50.800000 | 0.634000 | 4963.600000 |
| 25% | 32.00000 | 102.000000 | 1.000000 | 999.000000 | 0.000000 | -1.800000 | 93.075000 | -42.700000 | 1.344000 | 5099.100000 |
| 50% | 38.00000 | 180.000000 | 2.000000 | 999.000000 | 0.000000 | 1.100000 | 93.749000 | -41.800000 | 4.857000 | 5191.000000 |
| 75% | 47.00000 | 319.000000 | 3.000000 | 999.000000 | 0.000000 | 1.400000 | 93.994000 | -36.400000 | 4.961000 | 5228.100000 |
| max | 98.00000 | 4918.000000 | 56.000000 | 999.000000 | 7.000000 | 1.400000 | 94.767000 | -26.900000 | 5.045000 | 5228.100000 |
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_num)
df_scaled = pd.DataFrame(data = data_scaled, columns=df_num.columns)
df_scaled
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.533034 | 0.010471 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 1 | 1.628993 | -0.421501 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 2 | -0.290186 | -0.124520 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 3 | -0.002309 | -0.413787 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 4 | 1.533034 | 0.187888 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 41183 | 3.164336 | 0.292025 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41184 | 0.573445 | 0.481012 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41185 | 1.533034 | -0.267225 | -0.204909 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41186 | 0.381527 | 0.708569 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41187 | 3.260295 | -0.074380 | 0.156105 | 0.195414 | 1.671136 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
41188 rows × 10 columns
df_sample = df_scaled.sample(n=5000)
df_sample
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|
| 18036 | 0.381527 | -0.664485 | -0.565922 | 0.195414 | -0.349494 | 0.839061 | 0.591424 | -0.474791 | 0.773575 | 0.845170 |
| 21978 | -0.769980 | -0.213228 | -0.204909 | 0.195414 | -0.349494 | 0.839061 | -0.227465 | 0.951267 | 0.774152 | 0.845170 |
| 24208 | 0.093650 | -0.398359 | -0.565922 | 0.195414 | 1.671136 | -0.115781 | -0.649003 | -0.323542 | 0.328471 | 0.398115 |
| 22966 | 1.820911 | -0.348220 | 0.156105 | 0.195414 | -0.349494 | 0.839061 | -0.227465 | 0.951267 | 0.774728 | 0.845170 |
| 20338 | 0.285568 | -0.606631 | -0.565922 | 0.195414 | -0.349494 | 0.839061 | -0.227465 | 0.951267 | 0.775305 | 0.845170 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23333 | 1.053240 | 1.121256 | -0.204909 | 0.195414 | -0.349494 | 0.839061 | -0.227465 | 0.951267 | 0.774152 | 0.845170 |
| 29397 | -0.769980 | -0.961465 | 0.878132 | 0.195414 | -0.349494 | -1.197935 | -0.864955 | -1.425496 | -1.277824 | -0.940281 |
| 24479 | 1.628993 | -0.348220 | -0.204909 | 0.195414 | -0.349494 | -0.115781 | -0.649003 | -0.323542 | 0.328471 | 0.398115 |
| 31242 | 0.573445 | -0.251797 | -0.565922 | 0.195414 | -0.349494 | -1.197935 | -1.179380 | -1.231034 | -1.318759 | -0.940281 |
| 6819 | -1.057857 | -0.811047 | 0.878132 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
5000 rows × 10 columns
for perp in [30, 40, 50]:
for rs in [2,3]:
tsne = TSNE(n_components=2, perplexity=perp, random_state=rs)
data_tsne = tsne.fit_transform(df_sample)
df_tsne = pd.DataFrame(data_tsne, columns=['x', 'y'])
#df_tsne['predicted_cluster'] = df['predicted_cluster']
sns.scatterplot(x='x',y='y', data=df_tsne)
plt.title('t-SNE Plot with Perplexity Value %s and Random State %s' %(perp, rs))
plt.show()
Appereatly the scale of the data could be vastly different, hence the drastic different scale of variance. Hence we should do the standard scaling for the data
df_num.var()
age 108.602451 duration 67225.728877 campaign 7.672975 pdays 34935.687284 previous 0.244927 emp.var.rate 2.467915 cons.price.idx 0.335056 cons.conf.idx 21.420215 euribor3m 3.008308 nr.employed 5220.283250 dtype: float64
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_num)
df_scaled = pd.DataFrame(data = data_scaled, columns=df_num.columns)
df_scaled
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.533034 | 0.010471 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 1 | 1.628993 | -0.421501 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 2 | -0.290186 | -0.124520 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 3 | -0.002309 | -0.413787 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| 4 | 1.533034 | 0.187888 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 41183 | 3.164336 | 0.292025 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41184 | 0.573445 | 0.481012 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41185 | 1.533034 | -0.267225 | -0.204909 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41186 | 0.381527 | 0.708569 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
| 41187 | 3.260295 | -0.074380 | 0.156105 | 0.195414 | 1.671136 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 |
41188 rows × 10 columns
variabce of different attributes are very different, so it's better to scale the dataset
from pyclustertend import hopkins
num_trials=5
hopkins_stats=[]
for i in range(0,num_trials):
n = len(df_scaled)
p = int(0.1 * n)
hopkins_stats.append(hopkins(df_scaled,p))
print(hopkins_stats)
[0.010848178078693087, 0.011174801195989502, 0.0115420114785353, 0.01119288771509344, 0.011308594962426878]
Hopkins Statistics are close to 0 indicating that the data is highly clusterable.
The TSNE suggests that there are 8 to 12 clusters in the dataset, which are convex, but not balanced in size. But clusters also don't seperate well in general and may seem to overlap
Use only the numerical attributes to do the DBSCAN. Since we think the dataset has convex, but not balanced in size clustring structure, so I think DBSCAN could be a way to evaluate the clustering structure.
from sklearn.neighbors import NearestNeighbors
nearest_neighbors = NearestNeighbors(n_neighbors=11)
neighbors = nearest_neighbors.fit(df_scaled)
distances, indices = neighbors.kneighbors(df_scaled)
distances = np.sort(distances[:,10], axis=0)
fig = plt.figure(figsize=(5, 5))
plt.plot(distances)
plt.xlabel("Points")
plt.ylabel("Distance")
Text(0, 0.5, 'Distance')
from kneed import KneeLocator
i = np.arange(len(distances))
knee = KneeLocator(i, distances, S=1, curve='convex', direction='increasing', interp_method='polynomial')
fig = plt.figure(figsize=(5, 5))
knee.plot_knee()
plt.xlabel("Points")
plt.ylabel("Distance")
print(distances[knee.knee])
0.6027031883556995
<Figure size 360x360 with 0 Axes>
reference for the previous analysis
The analysis above is a demonstration of how to systematically locate the elbow point to reduce computational burden and narrow down the range of parameter.
# a rough idea for the optimization
c_b_score_list = []
n_cluster_list = []
n_noisy = []
for minpts in range(2,120):
nearest_neighbors = NearestNeighbors(n_neighbors=minpts+1)
neighbors = nearest_neighbors.fit(df_scaled)
distances, indices = neighbors.kneighbors(df_scaled)
distances = np.sort(distances[:,minpts], axis=0)
i = np.arange(len(distances))
knee = KneeLocator(i, distances, S=1, curve='convex', direction='increasing', interp_method='polynomial')
eps = distances[knee.knee]
dst = DBSCAN(eps=eps, min_samples=minpts, metric='euclidean', n_jobs=-1)
df['predicted_cluster'] = dst.fit_predict(df_scaled)
k = len(df.predicted_cluster.value_counts())-1
c_b_score = calinski_harabasz_score(df_scaled.loc[df.predicted_cluster!=-1], df.predicted_cluster[df.predicted_cluster!=-1])
n_cluster_list.append(k)
n_noisy.append(len(df_scaled.loc[df['predicted_cluster']==-1]))
if k>1:
c_b_score_list.append(c_b_score)
else:
c_b_score_list.append(0)
print('when minpts is {}'.format(minpts), 'epsilon is', distances[knee.knee], ', and Calinski-Harabasz score is',c_b_score)
when minpts is 2 epsilon is 0.36640658346965 , and Calinski-Harabasz score is 153.7076161515394 when minpts is 3 epsilon is 0.40885757259494926 , and Calinski-Harabasz score is 412.16316710792836 when minpts is 4 epsilon is 0.4463753554565966 , and Calinski-Harabasz score is 614.507285394908 when minpts is 5 epsilon is 0.48000034003841197 , and Calinski-Harabasz score is 866.2436343023454 when minpts is 6 epsilon is 0.5066121562246294 , and Calinski-Harabasz score is 1220.226059283365 when minpts is 7 epsilon is 0.5329442650064103 , and Calinski-Harabasz score is 1355.3262287023422 when minpts is 8 epsilon is 0.5566427143500972 , and Calinski-Harabasz score is 1623.8558300412274 when minpts is 9 epsilon is 0.5815153748595294 , and Calinski-Harabasz score is 1757.736474684737 when minpts is 10 epsilon is 0.6027031883556995 , and Calinski-Harabasz score is 1833.319146255402 when minpts is 11 epsilon is 0.6190939331898803 , and Calinski-Harabasz score is 2058.4972241960722 when minpts is 12 epsilon is 0.6397718859971614 , and Calinski-Harabasz score is 2213.6198257129636 when minpts is 13 epsilon is 0.6600489087565385 , and Calinski-Harabasz score is 2460.807098266356 when minpts is 14 epsilon is 0.6783449224938326 , and Calinski-Harabasz score is 2877.9454261629303 when minpts is 15 epsilon is 0.6897321718779765 , and Calinski-Harabasz score is 3021.2016796679018 when minpts is 16 epsilon is 0.7026306646219034 , and Calinski-Harabasz score is 3131.257166132219 when minpts is 17 epsilon is 0.7155491765637952 , and Calinski-Harabasz score is 3225.939308073227 when minpts is 18 epsilon is 0.7284756719748481 , and Calinski-Harabasz score is 3313.0470625253342 when minpts is 19 epsilon is 0.7352296937207922 , and Calinski-Harabasz score is 3389.0567842207633 when minpts is 20 epsilon is 0.7467155713041812 , and Calinski-Harabasz score is 3532.327625924168 when minpts is 21 epsilon is 0.7532947015420901 , and Calinski-Harabasz score is 3652.7214253603 when minpts is 22 epsilon is 0.7628691298869733 , and Calinski-Harabasz score is 3415.5261431158183 when minpts is 23 epsilon is 0.7686398176465985 , and Calinski-Harabasz score is 3406.3947658747425 when minpts is 24 epsilon is 0.7731719901913398 , and Calinski-Harabasz score is 3604.285559671329 when minpts is 25 epsilon is 0.7807714017665264 , and Calinski-Harabasz score is 3603.421346296123 when minpts is 26 epsilon is 0.7888615665401066 , and Calinski-Harabasz score is 3633.070149512467 when minpts is 27 epsilon is 0.795620131365043 , and Calinski-Harabasz score is 3626.7315621614894 when minpts is 28 epsilon is 0.8046599943285193 , and Calinski-Harabasz score is 3609.100566994008 when minpts is 29 epsilon is 0.8137525491862768 , and Calinski-Harabasz score is 3506.3967665197847 when minpts is 30 epsilon is 0.8229759379521419 , and Calinski-Harabasz score is 3591.7596315519545 when minpts is 31 epsilon is 0.831403685096498 , and Calinski-Harabasz score is 3712.3354152283728 when minpts is 32 epsilon is 0.8392451947154599 , and Calinski-Harabasz score is 3715.1738732694344 when minpts is 33 epsilon is 0.8483187988195239 , and Calinski-Harabasz score is 3848.4971948980346 when minpts is 34 epsilon is 0.8533031658361392 , and Calinski-Harabasz score is 3968.2086055039294 when minpts is 35 epsilon is 0.862666703945612 , and Calinski-Harabasz score is 4271.661010915096 when minpts is 36 epsilon is 0.8673261872159981 , and Calinski-Harabasz score is 4459.7471650965535 when minpts is 37 epsilon is 0.871791632703066 , and Calinski-Harabasz score is 4477.913267240038 when minpts is 38 epsilon is 0.8782617621177954 , and Calinski-Harabasz score is 4500.4076959602 when minpts is 39 epsilon is 0.8850712161610425 , and Calinski-Harabasz score is 4473.626097223358 when minpts is 40 epsilon is 0.8922162371874056 , and Calinski-Harabasz score is 4673.3750680576295 when minpts is 41 epsilon is 0.9005654918333337 , and Calinski-Harabasz score is 4706.383389866941 when minpts is 42 epsilon is 0.9106324428795032 , and Calinski-Harabasz score is 4929.299065400185 when minpts is 43 epsilon is 0.9198573143508282 , and Calinski-Harabasz score is 4932.288148122604 when minpts is 44 epsilon is 0.9268001575188757 , and Calinski-Harabasz score is 4922.57207560432 when minpts is 45 epsilon is 0.9321224711526165 , and Calinski-Harabasz score is 4922.141101513601 when minpts is 46 epsilon is 0.9389185107397352 , and Calinski-Harabasz score is 4918.427375338486 when minpts is 47 epsilon is 0.9451918028534211 , and Calinski-Harabasz score is 4923.6349491921155 when minpts is 48 epsilon is 0.9519922677257636 , and Calinski-Harabasz score is 4913.53141386744 when minpts is 49 epsilon is 0.9597838854528832 , and Calinski-Harabasz score is 4908.976188902224 when minpts is 50 epsilon is 0.966225086759706 , and Calinski-Harabasz score is 4962.925804702075 when minpts is 51 epsilon is 0.9736392846453881 , and Calinski-Harabasz score is 4958.110465987501 when minpts is 52 epsilon is 0.9807444046753602 , and Calinski-Harabasz score is 5241.879852599631 when minpts is 53 epsilon is 0.9883762544064262 , and Calinski-Harabasz score is 5232.755550125015 when minpts is 54 epsilon is 0.9956556006352026 , and Calinski-Harabasz score is 5495.471595259213 when minpts is 55 epsilon is 1.0050359007096543 , and Calinski-Harabasz score is 5500.439949305224 when minpts is 56 epsilon is 1.0117403667508256 , and Calinski-Harabasz score is 5868.31141350503 when minpts is 57 epsilon is 1.0195471643373875 , and Calinski-Harabasz score is 5874.282175299306 when minpts is 58 epsilon is 1.025716644947706 , and Calinski-Harabasz score is 5858.068920428723 when minpts is 59 epsilon is 1.0309964183724043 , and Calinski-Harabasz score is 5856.943701817789 when minpts is 60 epsilon is 1.0380317678214643 , and Calinski-Harabasz score is 5868.210727774269 when minpts is 61 epsilon is 1.0448517927825798 , and Calinski-Harabasz score is 5864.9534261093195 when minpts is 62 epsilon is 1.0511504048535567 , and Calinski-Harabasz score is 5865.601003535742 when minpts is 63 epsilon is 1.0556585856709761 , and Calinski-Harabasz score is 6246.082464183424 when minpts is 64 epsilon is 1.0601350465389456 , and Calinski-Harabasz score is 6245.044513194645 when minpts is 65 epsilon is 1.065663525119376 , and Calinski-Harabasz score is 5951.708660801146 when minpts is 66 epsilon is 1.0715477700349916 , and Calinski-Harabasz score is 6251.079606484863 when minpts is 67 epsilon is 1.0764730918038177 , and Calinski-Harabasz score is 5955.353691143363 when minpts is 68 epsilon is 1.0825228131769475 , and Calinski-Harabasz score is 5952.685263285243 when minpts is 69 epsilon is 1.0876892567711538 , and Calinski-Harabasz score is 5944.204813858742 when minpts is 70 epsilon is 1.0910797993435146 , and Calinski-Harabasz score is 6233.956182111314 when minpts is 71 epsilon is 1.0945239295589513 , and Calinski-Harabasz score is 6232.15847500028 when minpts is 72 epsilon is 1.1000395874496234 , and Calinski-Harabasz score is 6229.610515346436 when minpts is 73 epsilon is 1.1031993501289399 , and Calinski-Harabasz score is 6216.963875168452 when minpts is 74 epsilon is 1.1053027026426296 , and Calinski-Harabasz score is 6223.215917473444 when minpts is 75 epsilon is 1.108242546754736 , and Calinski-Harabasz score is 6223.108969396348 when minpts is 76 epsilon is 1.1128511415306415 , and Calinski-Harabasz score is 6576.4987284937715 when minpts is 77 epsilon is 1.1162441429159349 , and Calinski-Harabasz score is 6561.050815565326 when minpts is 78 epsilon is 1.1196229671350448 , and Calinski-Harabasz score is 7019.4166698292875 when minpts is 79 epsilon is 1.1221118228450764 , and Calinski-Harabasz score is 7069.225760067415 when minpts is 80 epsilon is 1.125104376140045 , and Calinski-Harabasz score is 7021.718741892465 when minpts is 81 epsilon is 1.1271927168528202 , and Calinski-Harabasz score is 7021.552629461315 when minpts is 82 epsilon is 1.129607863570365 , and Calinski-Harabasz score is 7695.141495141248 when minpts is 83 epsilon is 1.1326363738403962 , and Calinski-Harabasz score is 7700.742395320909 when minpts is 84 epsilon is 1.1352771221653353 , and Calinski-Harabasz score is 7653.898592375515 when minpts is 85 epsilon is 1.1383015559001164 , and Calinski-Harabasz score is 7654.7070680820625 when minpts is 86 epsilon is 1.1407958339874191 , and Calinski-Harabasz score is 7647.745901828785 when minpts is 87 epsilon is 1.143076028098715 , and Calinski-Harabasz score is 7645.912424675073 when minpts is 88 epsilon is 1.1465826751130037 , and Calinski-Harabasz score is 7645.592827144116 when minpts is 89 epsilon is 1.149640927513497 , and Calinski-Harabasz score is 7641.249196711894 when minpts is 90 epsilon is 1.1510783402536855 , and Calinski-Harabasz score is 7644.66427613455 when minpts is 91 epsilon is 1.1527682523600409 , and Calinski-Harabasz score is 7650.484657288981 when minpts is 92 epsilon is 1.156491518130497 , and Calinski-Harabasz score is 7639.196474823521 when minpts is 93 epsilon is 1.1592756261284622 , and Calinski-Harabasz score is 7637.962564059519 when minpts is 94 epsilon is 1.1619994633319664 , and Calinski-Harabasz score is 7615.578363850587 when minpts is 95 epsilon is 1.164419789002392 , and Calinski-Harabasz score is 7615.4660782112005 when minpts is 96 epsilon is 1.1687417370958695 , and Calinski-Harabasz score is 7603.36816289515 when minpts is 97 epsilon is 1.1705177922834715 , and Calinski-Harabasz score is 7590.609292419822 when minpts is 98 epsilon is 1.1730025952290155 , and Calinski-Harabasz score is 7589.398204594437 when minpts is 99 epsilon is 1.1764155675976078 , and Calinski-Harabasz score is 7587.348627691851 when minpts is 100 epsilon is 1.1791138025055454 , and Calinski-Harabasz score is 7591.881273905385 when minpts is 101 epsilon is 1.1808939079362688 , and Calinski-Harabasz score is 7598.8013782805065 when minpts is 102 epsilon is 1.182730092654086 , and Calinski-Harabasz score is 7600.376417072496 when minpts is 103 epsilon is 1.185788757962626 , and Calinski-Harabasz score is 7176.507138283457 when minpts is 104 epsilon is 1.1879077767829749 , and Calinski-Harabasz score is 8796.564377361918 when minpts is 105 epsilon is 1.1906176751219213 , and Calinski-Harabasz score is 8798.797151321836 when minpts is 106 epsilon is 1.1926998588423292 , and Calinski-Harabasz score is 8796.247337981387 when minpts is 107 epsilon is 1.195962663526896 , and Calinski-Harabasz score is 8797.925977202452 when minpts is 108 epsilon is 1.1971641987035928 , and Calinski-Harabasz score is 8802.740462672376 when minpts is 109 epsilon is 1.1998900135485868 , and Calinski-Harabasz score is 8799.533611089519 when minpts is 110 epsilon is 1.201338448164909 , and Calinski-Harabasz score is 8797.940555739904 when minpts is 111 epsilon is 1.2024771072693718 , and Calinski-Harabasz score is 8796.655242359182 when minpts is 112 epsilon is 1.2055298224951765 , and Calinski-Harabasz score is 8343.300453947228 when minpts is 113 epsilon is 1.2072745258525548 , and Calinski-Harabasz score is 8346.08935476873 when minpts is 114 epsilon is 1.2092695248569798 , and Calinski-Harabasz score is 8347.184988857633 when minpts is 115 epsilon is 1.2111897255432473 , and Calinski-Harabasz score is 8344.038504832364 when minpts is 116 epsilon is 1.2132168541005501 , and Calinski-Harabasz score is 8346.423741636194 when minpts is 117 epsilon is 1.214781724639423 , and Calinski-Harabasz score is 8343.96927623437 when minpts is 118 epsilon is 1.2167476358810858 , and Calinski-Harabasz score is 8345.073146119737 when minpts is 119 epsilon is 1.2192695800884932 , and Calinski-Harabasz score is 8347.232825105817
895.3715904575728
fig, ax = plt.subplots(1, 3, figsize=(15, 5), sharex=True)
ax[0].plot(range(2,120), c_b_score_list)
ax[1].plot(range(2,120), n_cluster_list)
ax[2].plot(range(2,120), n_noisy)
ax[0].set_title('Calinski-Harabasz score')
ax[1].set_title('Number of clusters')
ax[2].set_title('Number of noise points')
ax[0].set_xlabel('minpts')
ax[1].set_xlabel('minpts')
ax[2].set_xlabel('minpts')
Text(0.5, 0, 'minpts')
minpts_list = np.arange(2,120)
minpts_list[c_b_score_list == max(c_b_score_list)]
array([108])
From the analysis above, we see that setting minpts=108 gives the highest Calinski-Barbarsz. Also the number of clusters and noise points seems reasonable. So we'll fine tune the parameters from there
# Perform the preliminary analysis
minpts = 108
calinski_harabazs_scores = []
silhouette_scores = []
num_clusters = []
num_noise_points = []
num_clusters_p = []
n_noise_points_p = []
silhouette_scores_p = []
calinski_harabazs_scores_p = []
for eps in np.arange(1.15, 1.25, 0.01):
#Cluster the dataset using DBSCAN
dst = DBSCAN(eps=eps, min_samples=minpts, metric='euclidean', n_jobs=-1)
df['predicted_cluster'] = dst.fit_predict(df_scaled)
#Get the number of clusters in the clustering
k=len(df['predicted_cluster'].value_counts())-1
num_clusters.append(k)
#Get the number of noise points
noise_point_num=len(df[df['predicted_cluster']==-1])
num_noise_points.append(noise_point_num)
#Average silhouette score of the clustering
if k>1:
silhouette_scores.append(silhouette_score(df_scaled.loc[df.predicted_cluster!=-1], df.predicted_cluster[df.predicted_cluster!=-1]))
else:
#(If there is only one cluster, then the average silhouette score is 0)
silhouette_scores.append(0)
#Calinski-Harabasz score of the clsutering
if k>1:
calinski_harabazs_scores.append(calinski_harabasz_score(df_scaled.loc[df.predicted_cluster!=-1], df.predicted_cluster[df.predicted_cluster!=-1]))
else:
calinski_harabazs_scores.append(0)
# Show the results of the preliminary analysis
fig, ax = plt.subplots(1, 4, figsize=(15, 5), sharex=True)
eps = np.arange(1.15, 1.25, 0.01)
fig.suptitle('Minpts=%s'%minpts)
ax[0].plot(eps, calinski_harabazs_scores)
ax[1].plot(eps, silhouette_scores)
ax[2].plot(eps, num_noise_points)
ax[3].plot(eps, num_clusters)
ax[0].set_title('Calinski-Harabasz score')
ax[1].set_title('Average Silhouette score')
ax[2].set_title('Number of noise points')
ax[3].set_title('Number of clusters')
#ax[0].set_yticks(np.arange(5, 60, 7))
ax[0].set_xlabel('epsilon')
ax[1].set_xlabel('epsilon')
ax[2].set_xlabel('epsilon')
ax[3].set_xlabel('epsilon')
plt.show()
So from the fine tune section, we see that for minpts=108 $\epsilon = 1.15$ gives the highest Carlinski-Harabasz score and relatively high Average Silhouette score but meanwhile the noisy points are relatively high. While $\epsilon = 1.22$ gives highest Average Silhouette score but low Carlinski-Harabasz score. However the number of noise points are much lower than $\epsilon = 1.15$. So here we pick $\epsilon=1.22$ while minpts=108
dst = DBSCAN(eps=1.22, min_samples=108, metric='euclidean', n_jobs=-1)
df['predicted_cluster'] = dst.fit_predict(df_scaled)
df['predicted_cluster'].value_counts()
0 26303 2 5882 -1 3899 3 2148 6 1149 1 677 5 503 8 408 7 110 4 109 Name: predicted_cluster, dtype: int64
tsne = TSNE(n_components=2, perplexity=40, random_state=3)
data_tsne = tsne.fit_transform(df_scaled)
df_tsne = pd.DataFrame(data_tsne, columns=['x', 'y'])
df_tsne['predicted_cluster'] = df['predicted_cluster']
k=len(df_tsne['predicted_cluster'].value_counts())
sns.scatterplot(x='x',y='y', hue='predicted_cluster',
palette=sns.color_palette("husl", k-1), data=df_tsne[df_tsne['predicted_cluster']>=0])
sns.scatterplot(x='x',y='y',
color='black', data=df_tsne[df_tsne['predicted_cluster']==-1])
plt.title('DBSCAN with epsilon=%s and minpts=%s' %(1.22,108))
plt.legend(bbox_to_anchor=(1,1))
plt.show()
tsne = TSNE(n_components=2, perplexity=50, random_state=77)
data_tsne = tsne.fit_transform(df_scaled)
df_tsne = pd.DataFrame(data_tsne, columns=['x', 'y'])
df_tsne['predicted_cluster'] = df['predicted_cluster']
k=len(df_tsne['predicted_cluster'].value_counts())
sns.scatterplot(x='x',y='y', hue='predicted_cluster',
palette=sns.color_palette("husl", k-1), data=df_tsne[df_tsne['predicted_cluster']>=0])
sns.scatterplot(x='x',y='y',
color='black', data=df_tsne[df_tsne['predicted_cluster']==-1])
plt.title('DBSCAN with epsilon=%s and minpts=%s' %(1.22,108))
plt.legend(bbox_to_anchor=(1,1))
plt.show()
from sklearn.metrics import adjusted_rand_score, silhouette_samples, silhouette_score
def show_silhouette_plots(X,cluster_labels):
# This package allows us to use "color maps" in our visualizations
import matplotlib.cm as cm
#How many clusters in your clustering?
n_clusters=len(np.unique(cluster_labels))
# Create a subplot with 1 row and 2 columns
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(18, 20)
# The 1st subplot is the silhouette plot
# The silhouette coefficient fcan range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
#print("For n_clusters =", n_clusters,
#"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.show()
return
show_silhouette_plots(df_scaled.loc[df.predicted_cluster!=-1], df.predicted_cluster[df.predicted_cluster!=-1])
silhouette_score(df_scaled.loc[df.predicted_cluster!=-1], df.predicted_cluster[df.predicted_cluster!=-1])
For n_clusters = 9 The average silhouette_score is : 0.2883623592134597
0.2883623592134597
From the Silhoutte plot we can see that cluster 2 and 6 have particularity bad cohesion and separation while most of the cluster members fall below the average silhoutte score. Meanwhile other major clusters seems to be fine in terms of cohesion and separation. Overall the clustering
from scipy.spatial.distance import pdist, squareform
df_sort = df_scaled.copy()
df_sort['predicted_cluster']= df['predicted_cluster'].copy()
df_sort = df_sort.sample(8000)
df_sort=df_sort.sort_values(by=['predicted_cluster'])
df_sort=df_sort.drop(['predicted_cluster'], axis=1)
dist_mat = squareform(pdist(df_sort))
plt.pcolormesh(dist_mat)
plt.colorbar()
N = len(df_sort)
plt.xlim([0,N])
plt.ylim([0,N])
plt.show()
The similarity matrix shows that the clusters are somewhat close to one another.
df_scaled_des = df_scaled.copy()
df_scaled_des['label']=df['predicted_cluster']
import matplotlib.pyplot as plt
for i in range(0,9):
data = df_scaled_des[df_scaled_des['label']==i]
plt.figure(figsize=(15, 5), dpi=80)
sns.boxplot(data=data, fliersize=20)
plt.show()
from kmodes.kprototypes import KPrototypes
from gower import gower_matrix
df_cat = df.drop(df.columns[mask], axis=1)
df_go = pd.concat([df_scaled, df_cat],axis=1)
df_go
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | ... | marital | education | default | housing | loan | contact | month | day_of_week | poutcome | subscribed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.533034 | 0.010471 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 | ... | married | basic.4y | no | no | no | telephone | may | mon | nonexistent | no |
| 1 | 1.628993 | -0.421501 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 | ... | married | high.school | unknown | no | no | telephone | may | mon | nonexistent | no |
| 2 | -0.290186 | -0.124520 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 | ... | married | high.school | no | yes | no | telephone | may | mon | nonexistent | no |
| 3 | -0.002309 | -0.413787 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 | ... | married | basic.6y | no | no | no | telephone | may | mon | nonexistent | no |
| 4 | 1.533034 | 0.187888 | -0.565922 | 0.195414 | -0.349494 | 0.648092 | 0.722722 | 0.886447 | 0.712460 | 0.331680 | ... | married | high.school | no | no | yes | telephone | may | mon | nonexistent | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 41183 | 3.164336 | 0.292025 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 | ... | married | professional.course | no | yes | no | cellular | nov | fri | nonexistent | yes |
| 41184 | 0.573445 | 0.481012 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 | ... | married | professional.course | no | no | no | cellular | nov | fri | nonexistent | no |
| 41185 | 1.533034 | -0.267225 | -0.204909 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 | ... | married | university.degree | no | yes | no | cellular | nov | fri | nonexistent | no |
| 41186 | 0.381527 | 0.708569 | -0.565922 | 0.195414 | -0.349494 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 | ... | married | professional.course | no | no | no | cellular | nov | fri | nonexistent | yes |
| 41187 | 3.260295 | -0.074380 | 0.156105 | 0.195414 | 1.671136 | -0.752343 | 2.058168 | -2.224953 | -1.495186 | -2.815697 | ... | married | professional.course | no | yes | no | cellular | nov | fri | failure | no |
41188 rows × 21 columns
df_go_sample = df_go.sample(n=5000)
dist_mat=gower_matrix(df_go_sample)
dist_mat
array([[0. , 0.33903095, 0.36610395, ..., 0.27379286, 0.28793812,
0.25567502],
[0.33903095, 0. , 0.49738896, ..., 0.40231004, 0.43257588,
0.38900474],
[0.36610395, 0.49738896, 0. , ..., 0.14635019, 0.33817676,
0.34457314],
...,
[0.27379286, 0.40231004, 0.14635019, ..., 0. , 0.29401538,
0.29346105],
[0.28793812, 0.43257588, 0.33817676, ..., 0.29401538, 0. ,
0.16196132],
[0.25567502, 0.38900474, 0.34457314, ..., 0.29346105, 0.16196132,
0. ]], dtype=float32)
for rs in [66, 77]:
for perp in [5, 10, 20, 30, 40, 50, 60]:
tsne = TSNE(n_components=2, perplexity=perp, random_state=rs, metric='precomputed')
data_tsne = tsne.fit_transform(dist_mat)
df_tsne = pd.DataFrame(data_tsne, columns=['x_projected', 'y_projected'])
sns.scatterplot(x='x_projected',y='y_projected', data=df_tsne)
plt.title('t-SNE Plot with Perplexity Value %s and Random State %s' %(perp, rs))
plt.show()
The t-SNE plto shows a clear clustering structure, and the suggested number of cluster is around 6
tsne = TSNE(n_components=2, perplexity=30, random_state=77, metric='precomputed')
data_tsne = tsne.fit_transform(dist_mat)
df_tsne = pd.DataFrame(data_tsne, columns=['x_projected', 'y_projected'])
sns.scatterplot(x='x_projected',y='y_projected', data=df_tsne)
plt.title('t-SNE Plot with Perplexity Value %s and Random State %s' %(perp, rs))
plt.show()
cost_list = []
for k in range(5,13):
print(k)
kp = KPrototypes(n_clusters=k, random_state=101)
fit_clusters = kp.fit_predict(df_go_sample, categorical=[10,11,12,13,14,15,16,17,18,19,20])
cost_list.append(kp.cost_)
5 6 7 8 9 10 11 12
plt.plot(range(5,13),cost_list)
plt.xlabel('value of k')
plt.ylabel('cost value')
Text(0, 0.5, 'cost value')
The elbow plot of cost values doesn't support that the data has a clear clustering structure, which is contradictory to the t-SNE plot. So I'll go with the t-SNE plot which shows a clear structure and see if the results align with plot.
k_prototype_ass = []
for k in range(2,9):
kp = KPrototypes(n_clusters=k, random_state=101)
fit_clusters = kp.fit_predict(df_go_sample, categorical=[10,11,12,13,14,15,16,17,18,19,20])
df_tsne['predicted_cluster'] = fit_clusters
k_prototype_ass.append(silhouette_score(X=dist_mat, labels=fit_clusters, metric='precomputed'))
sns.scatterplot(x='x_projected',y='y_projected', hue='predicted_cluster', palette=sns.color_palette("husl", k), data=df_tsne)
plt.title('t-SNE Plot with Perplexity Value {} and Random State {}, while k = {}'.format(30, 77, k))
plt.show()
plt.plot(range(2,9), k_prototype_ass)
plt.xlabel('value of k')
plt.ylabel('average silhouette score')
plt.show()
From the t-SNE Plot and the average silhouette score plot, we can see that k=4 both fit the t-SNE better and also has relatively high average silhouette score.
k = 4
kp = KPrototypes(n_clusters=k, random_state=101)
fit_clusters = kp.fit_predict(df_go_sample, categorical=[10,11,12,13,14,15,16,17,18,19,20])
df_tsne['predicted_cluster'] = fit_clusters
sns.scatterplot(x='x_projected',y='y_projected', hue='predicted_cluster', palette=sns.color_palette("husl", k), data=df_tsne)
plt.title('t-SNE Plot with Perplexity Value {} and Random State {}, while k = {}'.format(30, 77, k))
plt.show()
df_tsne['predicted_cluster'].value_counts()
0 3383 2 1081 1 357 3 179 Name: predicted_cluster, dtype: int64
def show_silhouette_plots_for_dist(X, cluster_labels, metric='euclidean'):
# This package allows us to use "color maps" in our visualizations
import matplotlib.cm as cm
#How many clusters in your clustering?
n_clusters=len(np.unique(cluster_labels))
# Create a subplot with 1 row and 2 columns
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(18, 20)
# The 1st subplot is the silhouette plot
# The silhouette coefficient fcan range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
#print("For n_clusters =", n_clusters,
#"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
if metric == 'euclidean':
sample_silhouette_values = silhouette_samples(X, cluster_labels)
elif metric == 'precomputed':
sample_silhouette_values = silhouette_samples(X, cluster_labels, metric='precomputed')
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.show()
return
show_silhouette_plots_for_dist(X=dist_mat, cluster_labels =df_tsne['predicted_cluster'], metric='euclidean')
silhouette_score(X = dist_mat, labels = df_tsne['predicted_cluster'], metric = 'precomputed')
0.24294731
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import squareform
for link in ['single', 'complete','average']:
avg_ss=[]
for k in range(2,13):
#First, designate the HAC linkage function, and the number of clusters you want to extract from the resulting dendrogram
hac = AgglomerativeClustering(n_clusters=k, affinity='precomputed', linkage=link)
Y_pred = hac.fit_predict(dist_mat)
avg_ss.append(silhouette_score(X=dist_mat, labels=Y_pred,metric='precomputed'))
plt.plot([2,3,4,5,6,7,8,9,10,11,12], avg_ss)
plt.title('Average Silhouette Score wth HAC and %s Linkage'%link)
plt.xlabel('Number of Clusters')
plt.ylabel('Average Silhouette Score of Clustering')
plt.show()
for link in ['complete', 'average']:
for k in range(4,10):
#Clustering from dendrogram with k clusters
hac = AgglomerativeClustering(n_clusters=k, affinity='precomputed', linkage=link)
df_tsne['predicted_cluster'] = hac.fit_predict(dist_mat)
#Map the resulting cluster labels onto our chosen t-SNE plot
sns.scatterplot(x='x_projected',y='y_projected', hue='predicted_cluster', palette=sns.color_palette("husl", k), data=df_tsne)
plt.title('t-SNE Plot with {} Linkage Clustering with k={} Clusters'.format(link,k))
plt.legend(bbox_to_anchor=(1,1))
plt.show()
It seems that the most appropriate choice is average link with k=5.
hac = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
df_tsne['predicted_cluster'] = hac.fit_predict(dist_mat)
sns.scatterplot(x='x_projected',y='y_projected', hue='predicted_cluster', palette=sns.color_palette("husl", 5), data=df_tsne)
plt.title('t-SNE Plot with {} Linkage Clustering with k={} Clusters'.format('average',5))
plt.legend(bbox_to_anchor=(1,1))
plt.show()
silhouette_score(X=dist_mat, labels=df_tsne['predicted_cluster'], metric='precomputed')
0.25204557
show_silhouette_plots_for_dist(dist_mat,df_tsne['predicted_cluster'], metric='precomputed')
While we are comparing Algorithms runing on different subsets of a data set(DBSCAN on numerical features, Kprototype and Hiarchical Clustering on mixed features), it's not completely accurate to compare average Silhouette score directly. But roughly average Silhouette scores are not too different.
Judging from the result, DBSCAN suggests 8 clusters but a fair poportion of the data points are classified as outlier.
Kprototype suggest 4 clusters
While Hiarchical Clustering suggest a average link with 5 clusters
Considering the huge size of data, DBSCAN is not only more efficient and easier to imterpretate but also performed better in terms of identifying not only the major cluster but also smaller clusters which in business usually generate the most revenue. The only draw back is the fair poportion of outliers that are excluded from the clustering result. Kprototype and Hiarchical Clustering both used gower's distance hence take up much more computational power and time. All three models proposed different number of clusters, so the we can also combine business insight and field experience to select the appropriate model based on the results.